# Preparing a Perturbation

Before performing a perturbation on the network we prepare the graph the following way:
- Constructing Randomized Inward-Links for Fake Basal Species
- Pruning the Network Based on Optimal Foraging Theory

## Constructing Randomized Inward-Links for Fake Basal Species

The following process aims to create randomized inward-links for species which, due to missing data, falsely appear to be basal species. While these species do not hold a true basal role, they lack inward-edges in the network representation, thus misleadingly presenting them as basal species. 

The newly created links will provide a more realistic portrayal of these species' interactions within the ecosystem, based on their diet, habitat, and zone attributes.

Below is a detailed breakdown of this procedure:

1. **Link Assignment**: Starting with the dataset titled "species_for_randomized_link_assignment", we examine each row to extract the species (denoted as a 'taxon'), its diet, diet rank, habitat, and zone.

2. **Linking to Compatible Species**: We then refer to a secondary dataset to identify species that could potentially interact with our fake basal species. In this dataset, we find rows where the diet rank from the first dataset matches a column name and the habitat is the same. We connect the fake basal species to all such rows, with each connection representing a possible interaction link based on shared diet preferences and habitats.

3. **Removing Duplicate Links**: Following this, we may have duplicate links representing the same interactions. To address this, we amalgamate all identical links and remove duplicates.

4. **Sampling Links**: For each fake basal species, we retain a subset of potential links, specifically, 5% of these links (rounded up).

5. **Integration into the MetaWeb**: Finally, these selected links are added to our overarching ecosystem network, referred to as the 'MetaWeb'.

The number of links we retain varies depending on the species' dietary range:

- For species with a **Generalized** diet, we retain 5% of the potential interaction links.
- For species with a **Specialized** diet, we randomly select between 1 and 5 potential interaction links.

This ensures that our network adequately represents the varying interaction potential of species based on their diet specialisation.

In [None]:
import numpy as np
import random
import pandas as pd

In [None]:
def load_dataframe_from_csv(path):
    return pd.read_csv(path)

def explode_multivalue_columns(df, columns_to_explode):
    for column in columns_to_explode:
        df[column] = df[column].str.split('; ')
        df = df.explode(column)
    return df

def filter_subdiet_by_habitat_and_zone(nodes_df, species_info):
    sub_diet = nodes_df[nodes_df[species_info['Diet_Rank']] == species_info['Diet_Name']]
    return sub_diet[(sub_diet['Habitat'] == species_info['Habitat']) & (sub_diet['Zone'] == species_info['Zone'])]

def construct_linkage_dataframe(species_info, sub_diet_filtered):
    linkage_dataframe = pd.DataFrame(columns=['Diet_Range', 'Source', 'Target'])
    number_of_rows = len(sub_diet_filtered)
    linkage_dataframe['Source'] = np.full((number_of_rows), species_info['Taxon'])
    linkage_dataframe['Diet_Range'] = np.full((number_of_rows), species_info['Diet_Range'])
    linkage_dataframe['Target'] = sub_diet_filtered['Taxon'].values
    return linkage_dataframe

def sample_rows_based_on_diet_range(species_group):
    if species_group['Diet_Range'].iloc[0] == 'Generalised':
        return species_group.head(int(np.ceil(0.05*len(species_group))))
    else: # assuming other category is 'Specialised'
        return species_group.sample(min(len(species_group), random.randint(1, 5)))

def generate_links(nodes_df, random_species_links_df):
    all_possible_links = pd.DataFrame(columns=['Diet_Range', 'Source', 'Target'])
    
    for i in range(len(random_species_links_df)):
        species_info = random_species_links_df.iloc[i]
        sub_diet_filtered = filter_subdiet_by_habitat_and_zone(nodes_df, species_info)
        linkage_dataframe = construct_linkage_dataframe(species_info, sub_diet_filtered)
        all_possible_links = pd.concat([all_possible_links, linkage_dataframe])

    all_possible_links = all_possible_links.drop_duplicates(subset=['Source', 'Target'])
    return all_possible_links.groupby('Source').apply(sample_rows_based_on_diet_range).reset_index(drop=True)

node_list_dataframe = load_dataframe_from_csv('../node_lists/all_species_and_feeding_groups.csv')
node_list_dataframe = explode_multivalue_columns(node_list_dataframe, ['Habitat', 'Zone'])

random_species_links_dataframe = load_dataframe_from_csv('../node_lists/species_for_randomized_links.csv')
random_species_links_dataframe = explode_multivalue_columns(random_species_links_dataframe, ['Habitat', 'Zone'])

sampled_links = generate_links(node_list_dataframe, random_species_links_dataframe)

In [None]:
# Add edges to G
G = "Network"
for i in range(len(sampled_links)):
    edge = sampled_links[i]
    source = edge['Source']
    target = edge['Target']
    G.add_edge(source, target)

## Pruning the Network Based on Optimal Foraging Theory

Because of the expantion of the dataset, the food web actually represents a meta web, which is a potential food web. 
Therefore, before performing a perturbation on the network we remove some links.

Here's the step-by-step process:

1. **Classify generalists and specialists**: A threshold value of in-degree, `k`, is used for this classification. If a species (node in the network) has an in-degree greater than or equal to `k`, it is considered a generalist; otherwise, it is considered a specialist.

2. **Preserve a fraction of the links for generalists**: For each generalist, we randomly keep only 10% of its inward edges (links from other species to this one, signifying a predation relationship), effectively removing 90% of them. 

In [None]:
def remove_links(G, k_threshhold, percentage_to_remove=0.9):
    nodes = list(G.nodes)
    for node in nodes:
        in_degree = G.in_degree(node)

        if in_degree >= k_threshhold:
            inward_edges = list(G.in_edges(node))
            num_edges_to_remove = int(in_degree * percentage_to_remove)
            edges_to_remove = random.sample(inward_edges, num_edges_to_remove)
            G.remove_edges_from(edges_to_remove)
    return G

# usage
threshold_k = 10
G = remove_links(G, threshold_k)