# Part 3

----------

###  Exploring and identifying significant differences in the distribution of drug modalities across various subcellular locations, facilitating deeper insights.

In [1]:
# this snippet cleans the 'subcellularLocations' column by removing brackets and their contents using regular expressions, 
# then tallies the occurrences of each drug type in each subcellular location 
# (i simplified the subcellular location due to the massive amount of inconsistencies in the raw data. This may need to be refined in the future)
# finally we save the result as a CSV file (since we are dealing with a 2 dimensional small amount of data (2,3000) parquet files arnt nescessary).

import pandas as pd
import ast # for tallying
import re 

final_df = pd.read_parquet('./data/FINAL_DF.parquet')

# Convert strings to lists of dictionaries
final_df['subcellularLocations'] = final_df['subcellularLocations'].apply(ast.literal_eval)

# use regular expression module remove all text patterns for better grouping
def remove_brackets(text):
    return re.sub(r'\[.*?\]: ', '', text)

# lambda function to remove brackets and their contents from 'subcellularLocations' column, another simplification step to allow group by
final_df['subcellularLocations'] = final_df['subcellularLocations'].apply(lambda x: [remove_brackets(location) for location in x])

# this is a very important mapping step I decided to take in order to simplify the number of different subcellular locations. 
# If i were not to do this step then comparing the locations by where different drugs act would be much harder.
subcellular_mapping = {
    'Actin filaments': ['Actin filaments'],
    'Apical cell membrane': ['Apical cell membrane', 'Basal cell membrane', 'Basolateral cell membrane', 'Lateral cell membrane'],
    'Cell Junctions': ['Cell Junctions', 'Cell junction'],
    'Cell membrane': ['Cell membrane', 'Plasma membrane', 'Cell surface', 'Postsynaptic cell membrane', 'Presynaptic cell membrane'],
    'Centrosome': ['Centriolar satellite', 'Centrosome'],
    'Cytoskeleton': ['Intermediate filaments', 'Microtubules'],
    'Cytoplasm': ['Cytoplasm', 'Cytoplasmic bodies', 'Cytoplasmic vesicle', 'Cytosol'],
    'Endomembrane system': ['Endomembrane system', 'Endoplasmic reticulum', 'Endoplasmic reticulum membrane', 'Endoplasmic reticulum lumen', 'Golgi apparatus', 'Golgi apparatus membrane', 'Golgi outpost', 'Lysosome', 'Lysosome membrane', 'Lysosome lumen', 'Microsome', 'Microsome membrane', 'Recycling endosome', 'Recycling endosome membrane'],
    'Endosome': ['Early endosome', 'Endosome', 'Endosome membrane', 'Endosome lumen', 'Late endosome', 'Late endosome membrane'],
    'Mitochondria': ['Mitochondria', 'Mitochondrion', 'Mitochondrion inner membrane', 'Mitochondrion matrix', 'Mitochondrion membrane', 'Mitochondrion outer membrane'],
    'Nucleus': ['Nuclear bodies', 'Nuclear membrane', 'Nuclear speckles', 'Nucleoli', 'Nucleoli fibrillar center', 'Nucleoli rim', 'Nucleoplasm', 'Nucleus', 'Perikaryon'],
    'Peroxisome': ['Peroxisome', 'Peroxisome membrane', 'Peroxisome matrix'],
    'Secretion': ['Extracellular vesicle', 'Predicted to be secreted', 'Secreted'],
    'Vesicles': ['Vesicles'],
    'Synapse': ['Postsynapse', 'Synapse', 'Postsynaptic density'],
    'Others': ['Chromosome', 'Cytokinetic bridge', 'Focal adhesion sites', 'Lipid droplet', 'Melanosome', 'Melanosome membrane', 'Mitotic spindle', 'Photoreceptor inner segment', 'Rods & Rings', 'Rough endoplasmic reticulum', 'Sarcoplasmic reticulum membrane']
}

# finally we group
def map_subcellular_location(location):
    for group, locations in subcellular_mapping.items():
        if location in locations:
            return group
    return location

# map subcellular locations to their groups
final_df['subcellularLocations'] = final_df['subcellularLocations'].apply(lambda x: [map_subcellular_location(location) for location in x])

tally_df = final_df.explode('subcellularLocations')

# count occurances for count column
tally_df = tally_df.groupby(['drugType', 'subcellularLocations']).size().reset_index(name='count')

print(tally_df.head(100))

tally_df.to_csv('3_Cleaned_Data_Grouped.csv', index=False)


          drugType  subcellularLocations  count
0         Antibody       Actin filaments     27
1         Antibody  Apical cell membrane     45
2         Antibody        Cell Junctions     56
3         Antibody         Cell membrane    769
4         Antibody       Cell projection     49
..             ...                   ...    ...
95  Small molecule       Cell projection    243
96  Small molecule            Centrosome     48
97  Small molecule       Cleavage furrow      1
98  Small molecule             Cytoplasm   1915
99  Small molecule   Cytoplasmic granule      4

[100 rows x 3 columns]


In [38]:
# some basic summary statistics of our resulting drug type and subcellular location that it acts in 
data_df = pd.read_csv('3_Cleaned_Data_Grouped.csv')

summary_stats = data_df.groupby('drugType')['count'].describe()

total_counts = data_df.groupby('drugType')['count'].sum()

# top subcellular locations for each drug type
top_locations = data_df.groupby('drugType').apply(lambda x: x.nlargest(3, 'count'))

print("Summary Statistics:")
print(summary_stats)
print("\nTotal Counts by Drug Type:")
print(total_counts)
print("\nTop Subcellular Locations for Each Drug Type:")
print(top_locations)

Summary Statistics:
                 count        mean         std  min   25%   50%     75%  \
drugType                                                                  
Antibody          27.0   96.629630  184.139819  1.0  1.00  21.0   94.50   
Cell               8.0    1.875000    0.991031  1.0  1.00   1.5    3.00   
Enzyme             5.0    9.600000   13.557286  1.0  1.00   1.0   13.00   
Gene               9.0    3.888889    2.472066  1.0  2.00   3.0    6.00   
Oligonucleotide   17.0   18.176471   23.922362  1.0  2.00   5.0   27.00   
Oligosaccharide    6.0    6.666667    9.352362  1.0  1.25   2.5    6.75   
Protein           19.0   46.210526   77.019174  1.0  3.00  12.0   59.50   
Small molecule    36.0  259.416667  534.048760  1.0  5.75  34.5  201.75   
Unknown           19.0   23.789474   32.425263  1.0  3.00   7.0   35.00   

                    max  
drugType                 
Antibody          769.0  
Cell                3.0  
Enzyme             32.0  
Gene                8.0 

  top_locations = data_df.groupby('drugType').apply(lambda x: x.nlargest(3, 'count'))


-----------------

### Before visualising the data, to try and identify any significant differences in the distribution of drug modalities (drug types) across various subcellular locations we can conduct some statistical testing

---------------------

In [2]:
import pandas as pd
from scipy.stats import chi2_contingency

# Read the CSV into a DataFrame
tally_df = pd.read_csv('3_Cleaned_Data_Grouped.csv')

# Create a contingency table
contingency_table = pd.crosstab(tally_df['drugType'], tally_df['subcellularLocations'], values=tally_df['count'], aggfunc='sum', normalize='index')

# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Print the results
print("Chi-square statistic:", chi2)
print("P-value:", p)
print("Degrees of freedom:", dof)


Chi-square statistic: 6.4985407558099615
P-value: 1.0
Degrees of freedom: 304


Using Chi-square reveals no significant differences between drug type and subcellular location by which it operates. We will conduct some more visual inspections to explore further.

### [Click here to go to PART 4 - preprocessing stages](.\4_Visualise_Analysis_Results.ipynb)